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ABSTRACT 

OperationARIES! is an Intelligent Tutoring System that teaches 
scientific inquiry skills in a game-like atmosphere. Students 
complete three different training modules, each with natural 
language conversations, in order to acquire deep-level 
knowledge of 21 core concepts of research methodology (e.g., 
correlation does not mean causation). The student first acquires 
basic declarative knowledge and then applies the knowledge by 
critiquing case studies on scientific methodology and finally 
generating questions that reflect the core topics. A study using a 
pretest-training-posttest design was conducted in which 46 
college students interacted with the modules of 
OperationARIES!, resulting in thousands of logged measures. 
The goal of this investigation was to discover the different 
trajectories of learning within 1 1 of these core concepts by 
evaluating 3 main constructs (e.g., discrimination, generation, 
and time on task) represented by key logged measures. Different 
constructs showed relationships with specific core concepts. 
Three core concepts were analyzed with stepwise regression and 
5-fold cross-validation in order to discover contributing factors 
to learning gains for these core concepts. 
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1. INTRODUCTION 

Social scientists often emphasize differences among 
students in their analyses of learning. The present research 
acknowledges such differences among students and 
aptitude-treatment interactions [1], However the salient 
message in this study puts the magnifying glass on 
differences between core concepts in a subject matter. 
Simply put, the learning trajectories of core concepts may 


differ substantially depending on their content, 
complexity, and difficulty. 

I. 1 Cognitive Constructs Predicting Learning 

The cognitive and learning sciences have identified 
principles of learning that offer likely hypotheses 
regarding differences in learning trajectories for core 
concepts [2], Some concepts are learned by simply 
spending time reading and studying the material, a factor 
called time on task [3]. Time on task is normally 
optimized when concepts are presented on multiple 
occasions and distributed over time rather than 
concentrated in one time block [2, 4, 5]. Some concepts 
are learned primarily by actively generating the associated 
information about the concepts [2,4], particularly 
explanations [2, 5, 6, 7, 8]. Some concepts are best 
learned by testing experiences [9 ] and feedback on their 
answers [10], whereas others are best learned by either 
tutorial interaction [8, 11, 12, 13], scaffolding to get the 
student to generate good questions about difficult 
conceptualizations [14, 15], or tasks to get the student to 
make important discriminations among alternatives [8, 

II, 13, 14]. The present study investigates the training 
events and experiences that contribute to the acquisition 
of critical core concepts. Our central point is simple. Core 
concepts have idiosyncratic characteristics that lend 
themselves to particular learning activities that optimize 
their acquisition. 

The goal of this investigation is to discover the cognitive 
factors that predict the learning of core concepts in 
research methodology. The concepts range from concrete 
to abstract topics [8, 11, 14] and may require the student 
to utilize different skills. For example, understanding the 
meaning of an operational definition may be quite shallow 
in nature and possibly only require more time on task. 
Conversely, a more challenging abstract topic such as 
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correlation vs. causation may not be mastered by simply 
memorizing a definition but rather by higher level 
reasoning, discrimination among similar constructs, and 
generating ideas or questions. The learning environment is 
a serious game called Operation ARIES, as described next 
[12]. Although we have considered thousands of measures 
collected during 20 hours of training in ARIES, our 
analyses converged on three broad time-honored 
constructs in the cognitive and learning sciences: time on 
task, discrimination, and generation. 

1.2 Operation ARIES: A Serious Game 

OperationARlES! (called ARIES for short, an acronym 
meaning Acquiring Research Investigative and Evaluative 
Skills) is an Intelligent Tutoring System that has an 
embedded storyline and game-like elements to engage 
students as they learn research methodology. The 
narrative includes alien invaders who have come to take 
over the world by presenting bad science. The student 
player joins forces with the Federal Bureau of Science in 
order to save the world from this threat. The storyline and 
the iterative presentation of these topics are presented to 
the students across three specific ARIES modules (i.e., 
Training module, Case Study module, and Interrogation 
module), each focusing on different types of knowledge 
acquisition: didactic knowledge, application, and question 
generation. The learner interacts in natural language 
conversations with multiple artificial agents in order to 
learn 21 core concepts of research methodology. 

In the Training module students learn didactic knowledge 
by reading an E-text, answering multiple choice questions, 
and having dynamic tutorial conversations with two 
pedagogical agents about the 21 core concepts. In the 
Case Study module, students apply the knowledge by 
conversing with three artificial agents while identifying 
flaws in research cases with the aid of both a list of 12 
potential flaws and the E-book. Finally, in the 
Interrogation module, students pose questions to an 
artificial agent in order to decide if the research case is 
sound. The learner is aided by a score-card which 
provides immediate feedback as well as suggested 
questions. The flaws covered in the Case Study module 
and Interrogation module are aligned with the core 
concepts in the Training module. 

This paper explores the specific cognitive activities in this 
serious game that predict learning of a subset of the 21 
core concepts. These cognitive activities are part of the 
Training, Case Study, and Interrogation modules. 

2. METHODS 

The participants were 46 students at 2 separate schools in 
Southern California. There was a pretest-training -posttest 
design, with two versions of a test that were 
counterbalanced between pretest and posttest. All of the 
students were enrolled in research methodology courses 
taught by the same instructor. The pretest and posttest 


consisted of open-ended and multiple-choice questions 
about the 21 core concepts. The participants interacted 
with the Training module in pairs, alternating between 
actively typing into the system and passively observing 
their human partner interacting (a difference that was not 
analyzed in this study). The participants intermittently 
answered survey questions about the storyline and tutorial 
conversations, but these measures are not investigated in 
the current study. The alternation between partners as well 
as the surveys did not occur in the latter two modules 
(Case Study and Interrogation). 

2.1 Measures 

The log fdes of ARIES had thousands of measures 
including fine-grained measures for each module. 
Measures include latency measures, string variables and 
virtually every aspect of the typed interaction. With so 
many variables, the focus of this particular investigation 
will be on those measures that funnel into the three 
constructs of time on task, generation, and discrimination. 

Each of the 3 constructs was represented by a unique 
indicator for each module. Specifically, time on task was 
represented in the Training module by reading times per 
page in the E-Text, whereas the time spent on cases was 
the measure for the Case Study and Interrogation modules. 
In order to assess generation, the measures consisted of 
the number of words articulated by the student in 
conversational turns for each module. Discrimination 
scores were collected in each module. The Training 
module used the multiple-choice performance scores (0 to 
1). In the Case Study module, a discrimination score was 
calculated by subtracting the proportion of false alarms 
from hits as reflected by the match scores of the language 
processing algorithms within the system. The 

Interrogation module also used signal detection 
components derived from student performance on the 
score-card that discriminated whether a flaw was or was 
not present in a study. 

In order to measure learning gains, we computed 
proportion scores for the pretest and posttest. Each test 
consisted of a multiple-choice and short-answer question 
corresponding to each of the 21 concepts. Proportional 
learning gains scores [(posttest-pretest)/(l-pretest)] were 
calculated in order to adjust for the variation of prior- 
knowledge across the students. These scores were 
available for each of the 21 concepts. 

3. ANALYSES 

Although this original dataset consisted of 46 participants, 
10 of the subjects were removed due to extensive amounts 
of missing data (i.e. usually more than one module). Of 
the remaining 36 students, mean values were used to 
replace the missing data for discrimination scores. 
However, time on task and generation scores were simply 
left as O’s.The most complete set of original data, prior to 
mean replacements were available for 11 core concepts. 
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These core concepts were presented and tested across all 
three modules, so they were selected in the subsequent 
analyses. . 

3.1 Correlations 

The proportional learning gain scores ranged from .17 
(Causal Claims) to .50 (Subject Bias), with a mean of .34 
over the 11 core concepts. We computed correlations 
between these gain scores and the training process 
measures. We found a number of significant correlations, 
but the more important conclusion is that the profile of 
process to learning correlations differed greatly among 
core concepts. 

It is beyond the scope of this report to present the full set 
of data. Instead, we will focus on a few core concepts that 
illustrate the differences. For example, the Training 
module reigned in the learning of one core concept 
(Objective Scoring of the Dependent Variable) when 
inspecting the correlations, which were significantly 
positive for the three measures: reading time, words 
generated, and discrimination. In contrast, the 
Interrogation module was most important for Subject 
Bias, where the corresponding three measures had 
significant correlations. 

The differences in learning process profiles among core 
concepts underscores our central claim that core concepts 
vary considerably in learning trajectories. 

3.2 Stepwise Regressions and Cross- 
Validation 

We performed analyses on three core concepts that had 
distinctive profiles of correlations. These included 
Objective Scoring, Subject Bias, and Causal Claims. Each 
of these core concepts was analyzed separately using 
stepwise regressions with predictor variables that included 
those with the highest correlations (r >|.2|) with 
proportional learning gains. The resulting model was then 
cross- validated using a 5 -fold procedure with 4 folds for 
training and 1 for test.. 

3.2.1 Objective Scoring of the Dependent Variable 
This core concept showed correlations with the 
proportional learning gains for the reading times (time on 
task measure, r = .32, p<,05) and the multiple choice 
questions (discrimination score, r = .32, /?<.05) in the 
Training module. In all 3 modules, the number of words 
generated significantly correlated with proportional 
learning gains (Training (r = .42, /?<.05); Case Study(r 
=.28, /?<.05); Interrogation (r =.28, p <.05). When these 
significant correlates were entered into a stepwise 
regression, the analysis removed the time allocated to 
multiple choice questions (time on task) and the words 
generated in the Training module, thereby converging on 
a model that includes words generated in the Interrogation 
module and the Case Study module and the reading times 
from the Training module {F (3, 33) = 4.91, R 2 = .31, 


p<.05). In the full model, the words generated in the 
Interrogation module had a marginally significant main 
effect ( F (3, 33) = 3.61, p = .06); the words generated in 
the Case Study module did not have a significant main 
effect (F (3, 33) = 2.45, p = .13), but reading times were 
significant ( F (3,33) = 8.67, p<.05). Given these results, a 
second model was created using the generation score for 
the Interrogation module and the reading times. The 
model was significant ( F (2, 34) = 4.338, R "= .20, p<.02) 
with a marginally significant main effect for generated 
words ( F (2, 34) = 3.23, p = .08) and a significant main 
effect for reading times ( F (2, 34) = 5.45, p<. 05). When 
this model was cross validated, the training set accounted 
for 26% of the variance ( R = .26), and a test set 
accounted for 25% of the variance (R 2 = .25) 

3.2.2 Subject Bias 

For this core concept, the variables with the highest 
correlations with learning gains were the multiple choice 
discrimination score from the Training module (r =.20, 
p<.10), and the discrimination (r = .20, p<.\), generation 
(r =.33, p<.05), and time on task (r = .26, p<.05) 
measures from the Interrogation module. With all 
predictors entered into a stepwise regression, the resulting 
significant model included only the words per case 
(generation) and the discrimination score from the 
Interrogation module (F (2, 34) = 3.304, R 2 = .16, p<.05). 
Upon further examination, there is a significant main 
effect for generation ( F (2, 34) = .498, p<. 05) but not for 
the discrimination score (7 7 =1.63,/>>.05). A second linear 
model with just the generation score was significant model 
{F (1, 35) = 4.368, R 2 = .1 1, p <.05). Next, the significant 
generation predictor only was cross-validated using a 5- 
fold cross validation procedure resulting in a training set 
predicting 8% of the variance ( R 2 =.08) and a test set 
predicting 6% of the variance (R 2 = .06). However, we are 
still tentative about drawing strong conclusions from this 
because of the low power in detecting differences in the 
regression. 

3.2.3 Causal Claims 

This core concept had low learning gains (.17) compared 
with the other topics. The two variables with highest 
correlations for learning were discrimination from the 
Case Study module (r = .28, p<.05) and the generation 
metric in the Interrogation module (r = .23, p <.l). 
However, a follow-up analysis with stepwise multiple 
regression was only marginally significant ( F (2, 34) = 
2.863, R 2 = .14, p=.07) and cross validation assessments 
were not significant. 

4. CONCLUSIONS 

Our analyses revealed very different learning profiles for 
specific core concepts in research methodology. The value 
of the didactic Training module was most pronounced for 
Objective Scoring of Dependent Variables, whereas the 
Interrogation module was most successful for Subject 
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Bias, and Case Study was most promising for Causal 
Claims. The constructs of time on task, generation of 
information, and discrimination were also quite different 
for the different core concepts. Moreover, students did not 
learn much about differentiating causal from correlational 
claims. This topic may be very abstract to many students, 
difficult to comprehend, and in need of substantially more 
training. 

One important implication of this study is that the 
different core concepts might be assigned different 
modules or a different amount of training allocated to 
each module. For some core concepts, it may be 
sufficient to have them read text and prompt them to 
articulate propositions in language. For other core 
concepts, they need a large number of case study 
examples to apply their knowledge in a discriminating 
fashion. Simply put, training experiences need to be 
optimally allocated to the constraints of content. 

There are a number of limitations in this study that 
prevent us from making more definitive claims about the 
type of training that should be matched to our core 
concepts. The study had a low number of participants and 
a moderate number of missing values for observations. 
Flowever, we can confidently state that correlations 
between learning gains and the key constructs of 
generation, discrimination and time on task do vary across 
core concepts of research methodology in 
OperationARIES ! .It is important to explore different 
learning trajectories of specific core concepts in addition 
to differences among students. 
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